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Abstract 

We focus on row sampling based approximations for matrix algorithms, in particular matrix 
multipication, sparse matrix reconstruction, and £2 regression. For A G ^rnxd points in 
d <^ m dimensions), and appropriate row-sampling probabilities, which typically depend on 
the norms of the rows of the m x d left singular matrix of A (the leverage scores), we give 
row-sampling algorithms with linear (up to polylog factors) dependence on the stable rank of A. 
This result is achieved through the application of non-commutative Bernstein bounds. 

Keywords: row-sampling; matrix multiplication; matrix reconstruction; estimating spectral 
norm; linear regression; randomized 

1 Introduction 

Matrix algorithms (eg. matrix multiplicat i on, S VD, £2 regression) are of wi despread use in man y 
application areas: dat a mining (lAzar et aLl.l200lll: recommendations systems (IDrineas et a/.l. l2002ll ; 



information retrieval (Berry et al. 



2008 



1995 



Achlioptas et al. . 2001); clustering (Drineas et al. . 2004; McSherrv . 200 ll ): mixture modeling ( Kannan et al 



Papadimitriou et al\. l200d) : we b search (jKleinbere 



1999 



Achlioptas and McSherrvl . |2005| ) : etc. Based on the importance of matrix algorithms, there 
has been consi derable research energy expe nded on breaking the 0{mcP) bound required by exact 



SVD methods ()Golub and van Loanl . 1 19961). 



Starting with a seminal result of Frieze et al. ( 19981 ). a larg e number of results using non- 
uniform sampling to speed up matrix computations have appeared (I Achlioptas and McSherrvl. 120071: 



Deshpande et a/.l.l2006l:lDeshpande and Vempalal . l2006l : lDrineas et al 



2006a 



bed 



2006 



Rudelson and Vershvnin . 



1.^^ — ^ ^- — — , — 17 ■ — — — — — 17 1 — aur ^^^j^j^^^^^^^^^j 

Magen and Zouzias j_ 20101). some of which give relat ive error guarantees ( Deshpande et al 



Drineas et al . 2006d i; Magen and Zouzias . 20 id ) 



20061 : iDeshpande and Vempala . 

Even more recently, ISarlosI (|2006l ) showed how random projections or "sketches" can be used 
to perform all these tasks efficiently, obtaining the first o[m(P) algorithms when preserving the 
identity of the rows the mselves are not important . In fact, we will find many of these techniques, 
together with those in lAilon and Chazellel (|200d ) essential to our algorithm for generating row 
samples ultimately leading to o{m(P) algorithms based on row-sampling. From now on, we focus 
on row-sampling algorithms. 

We start with the basic result of matrix multiplication. All other results more or less follow 
from here. I n an independent recent work w hich is developed along the lines o f using isoperimetric 
inequalities ( Rudelson and Vershvnin . 200?! ) to obtain matrix Chernoff bounds, Magen and Zouzias 
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show that by samphng nearly a hnear number of rows, it is possible to obtain a relative 
error approximation to matrix multiplication. Specifically, let A G j^mxcii and B S M."^^'^^. Then, 
for r = r2(/9/e^ log((ii + ^2)) (where p bounds the stable (or "soft") rank of A and B - see later), 
there is a probability distribution over X = {1, . . . , m} such that by sampling r rows i.i.d. from I, 
one can construct sketches A, B such that A B ~ A^B. Specifically, with constant probability. 



lA^B 



A^BIL < ellAIUIlBl 



The sampling distribution is relatively simple, relying only on the product of the norms of the rows 
in A and B. This result is applied to low rank matrix reconstruction and ^2-i'egression where the 
required sampling distribution needs knowledge of the SVD of A and B. 

Our basic result for matrix multiplication is very similar to this, and we arrive at it through a 
different path using a non-commutative Bernstein bound. Our sampling probabilities are different. 
In appication of our results to sparse mat rix reconstructi o n and ^2-i'egression, the rows of the left 
singular matrix make an appearance. In iMagdon-Ismaill (j20ld ). it is shown how to approximate 
these probabilities in o{m(f) time using random projections at the expense of a po ly- logarithmic 



factor in running times. Further refinements lead to an even more efficient algorithm iDrineas et al. 



As mentioned above, we must confess t hat one may p erform our matrix tasks more efficiently 
using these same random projection methods (Sarlos, 20061 ). however the resulting algorithms are 
in terms of a small number of linear combinations of all the rows. In many applications, the actual 
rows of A have some physical meaning and so methods based on a small number of the actual rows 
are of interest. 

We finally mention that Magen and ZouziasI ( 2010l ) also give a dimension independent bound 
for matrix multiplication using some stronger tools. Namely, one can get the matrix multiplication 
approximation in the spectral norm using r = ^{p / \og{p / e^)) . In practice, it is not clear which 
bound is better, since there is now an additional factor of inside the logarithm. 



1.1 Basic Notation 

Before we can state the results in concrete form, we need some preliminary conventions. In general, 
e G (0, 1) will be an error tolerance parameter; /? E (0, 1] is a parameter used to scale probabilities; 
and, c, c' > are generic constants whose value may vary even within different lines of the same 
derivation. Let ei, . . . , be the standard basis vectors in M™. Let A G M™^'^ denote an arbitrary 
matrix which represents m points in M^. In general, we might represent a matrix such as A (roman, 
uppercase) by a set of vectors ai, . . . ,am G (bold, lowercase), so that A^ = [ai a2 ... a^]; 
similarly, for a vector y, y'^ = [yi, . . . ,ym\- Note that a^ is the row of A, which we may also 
refer to by ^[t) \ similarly, we may refer to the column as A^*-*. Let rank(A) < min{m, d] be the 
rank of A; typically d and for concreteness, we will assume that rank(A) = d (all the results 
easily generalize to rank(A) < d). For matrices, we will use the spectral norm, || • ||; on occasion, 
we will use the Frobenius norm, || • ||^. For vectors, || • \\p = || • || (the standard Euclidean norm). 
The stable, or "soft" rank, p(A) = ||A|||,/||Af < rank(A). 
The singular value decomposition (SVD) of A is 

A = UaSaVI. 

where \Ja_ is an m x set of columns which are an orthonotmal basis for the column space in 
A; Sa is a d X d positive diagonal matrix of singular values, and Y is a d x d orthogonal matrix. 
We refer to the singular values of A (the diagonal entries in Sa) by iTj(A). We will call a matrix 
with orthonormal columns an orthonormal matrix; an orthogonal matrix is a square orthonormal 
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VaVI 



I 



dxd- 



It is possible to extend Ua to a full 



matrix. In particular, U^Ua = V^Va 
orthonormal basis of M'", [Ua, U^]. 

The SVD is important for a number of reasons. The projection of the columns of A onto the 
k left singular vectors with top k singular values gives the best rank-A; approximation to A in the 
spectral and Frobenius norms. The solution to the linear regression problem is also intimately 
related to the SVD. In particular, consider the following minimization problem which is minimized 
at w*: 



Z* 



mm 

w 



lAw 



It is known ( Golub and van Loan! . Il99fil l that Z 



|Ui(Ui)-y|| 



and w* = VaSX^UXy. 



Row- Sampling Matrices Our focus is algorithms based on row-sampling. A row-sampling 
matrix Q € R''^"^ samples r rows of A to form A = QA: 



Q 



A = QA 



"r^A 






_r^A 







where = At^ ej^. ; it is easy to verify that the row rJA samples the row of A and rescales it. We 
are interested in random sampling matrices where each is i.i.d. according to some distribution. 
Define a set of sampling probabilities pi, . . . ,pm, with Pi > and X^i^iP* — ^'t then rj = et / yjrpt 
with probability pt. Note that the scaling is also related to the sampling probabilities in all the 
algorithms we consider. We can write Q^Q as the sum of r independently sampled matrices. 



Q^Q 
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where rjrj is a diagonal matrix with only one non-zero diagonal entry; the t*'^ diagonal entry 
is equal to \lpt with probability pi. Thus, by construction, for any set of non-zero sampling 
probabilities, E[rjrJ] = Im.xm- Since we are averaging r independent copies, it is reasonable to 
expect a concentration around the mean, with respect to r, and so in some sense, Q^Q essentially 
behaves like the identity. 



1.2 Statement of Results 

The two main results relate to how orthonormal subspaces behave with respect to the row-sampling. 
These are discussed more thoroughly in Section [3l but we state them here summarily. 

Theorem 1 (Symmetric Orthonormal Subspace Sampling). Let U G R»^xrf orthonormal, and 
S € R'^^'^ be positive diagonal. Assume the row-sampling probabilities pt satisfy 



Pt>f3 



trace (S^) 



Then, if r > (4p(S)//3e ) In with probability at least 1 — S, 

IIS^ - SU^Q^QUSII < e||S| 



We also have an asymmetric version of Theorem [H which is actually obtained through an 
application of Theorem [1] to a composite matrix. 
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Theorem 2 (Asymmetric Orthonormal Subspace Sampling). Let W G M'^^'^i, V G ]R'"X'^2 
orthonormal, and let Si E i^c'ix'^i qj^^ g i^fi2xd2 positive diagonal matrices; let pi = p{Si). 

Consider row sampling probabilities 

Pi + P2 

Ifr> (8(pi + P2)/Pe^) In ^i^i^^, t/ien mt/i probability at least 1-5, 

IISiW^VSs - SiW^Q^QVSzll < e||Si||||S2|| 

We note that these row samphng probabi hties are not the usual product row sampling proba- 
bilities one uses for matrix multiplication as in brineas et aTl (|2006al l. Computing the probabilities 



requires knowledge of the spectral norms of Sj. Here, Sj are given diagonal matrices, so it is easy to 
compute ||Si||. In the application of these results to matrix multiplication, the spectral norm of the 
input matrices will appear. We will show how to handle this issue later. As a byproduct, we will 
give an efficient algorithm to obtain a relative error approximation to ||A|| based on row sampling 
and th e power- iteration, which improves upon Woolfe et al. ( 20081 ): Kuczyhski and Wozniakowski 



(1 19891 ). 



We now give some applications of these orthonormal subspace sampling results. 

Theorem 3 (Matrix Muhiphcation in Spectral Norm). Let A G R"^x''i and B G ]R'">«^2 /j^^g 
rescaled rows at = at/||A|| and ht = bt/||B|| respectively. Let pA (resp. ps) be the stable rank of 
A (resp. Bj. Obtain a sampling matrix Q G M'^^™ using row-sampling probabilities pt satisfying 

Then, if r > ^'^^^f^'^ In ^('^1+'^^) ^ with probability at least 1 — 5, 

IIA^B- A^B|| < e||A||||B||. 

The sampling probabilities depend on ||A||^ and ||B||^. It is possible to get a constant factor 
approximation to ||A||^ (and similarly ||B||^) with high probability. We summarize the idea here, 
the details are given in Section [71 Theorem [25l First sample A = QA according to probabilities pt = 
af/||A||^. These probabilities are easy to compute in 0{mdi). By an application of the symmetric 
subspace sampling theorem (see Theorem I17p . if r > {Ap^/^'^) In then with probability at least 
1-5, 



(l-e)||Af < HAWAII < (l + e)||Af . 



We now run r2(ln^) power iterations starting from a random isotropic vector to estimate the 
spectral norm of A^A. The efficiency is 0{mdi + pAdi/e^ In^(T))- 



i 

Theorem 4 (Sparse Row-Based Matrix Reconstruction). Let A have the SVD representation A 

IIA-Alifcll < ||A-Afc||, 
for k = 1, . . . ,d, where Hk projects onto the top k right singular vectors of A. 



USV^, and consider row-sampling probabilities pt satisfying pt > ^ujut. Then, if r > {4{d 
P)/(3e^) In with probability at least 1 — 5, 
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It is possible to obtain relative approximations to the sampling probabilities according to the 
rows of the left singlu a r matrix (the l everag e scores), but that goes beyond the scope of this work 
Maedon-Ismail toidi ): lOrineas et ail (|201C 



Theorem 5 (Relative Error £2 Regression). Let A G ^>nxd /^^^g ^/^g SVD representation A 
USV^, and let y G M"^. Let x* = A"^y be the optimal regression with residual e = y — Ax* 
y — AA+y. Assume the sampling probabilities pt satisfy 



Pt>f3 



u 



d+1 



Forr > (8(d + l)//3e2)ln^^, 
probability at least 1 — 35, 



let X = (QA)"'"Qy be the approximate regression. Then, with 



|Ax-y|| < 1 + e + e 



1 + e 



1 



I Ax* 



In addition to sampling according to we also need the residual vector 



e = y - AA+y. 

Unfortunately, we have not yet found an efficient way to get a good approximation (in some form 
of relative error) to this residual vector. 



1.3 Paper Outline 

Next we describe some probabistic tail inequalities which will be useful. We continue with the 
sampling lemmas for orthonormal matrices, followed by the applications to matrix multiplication, 
matrix reconstruction and ^2-regression. Finally, we discuss the algorithm for approximating the 
spectral norm based on sampling and the power iteration. 



2 Probabilistic Tail Inequalities 

Since all our arguments involve high probability results, our main bounding tools will be probability 
tail inequalities. First, let Xi, . . . , Xn be independent random variables with E[Xj] = and |Xj| < 
7; let Zn = ^ Y17=i -^i- Chernoff, and later Hoeffding gave the bound 

Theorem 6 dChernofj (jl952l l: iHoeffdind ^imi )). P[|Z„| > e] < 26""''/^'^'. 

If in addition one can bound the variance, K[Xf] < s^, then we have Bernstein's bound: 
Theorem 7 (iBernsteinI (|l924l ll. P[|Z„| > e] < 2e-"''/(2s^+27e/3)^ 

Note that when e < 85^/7, we can simplify the Bernstein bound to P[|Z„,| > e] < 26""*^^/^*^, 
which is considerably simpler and only involves the variance. The non-commutative versions of these 
bounds, which extend these inequalities to matrix valued random variables can also be deduced. Let 
Xi, . . . ,X„ be independent copies o f a symmetric random r natrix X, with E[X] = 0, and suppose 
that IIXII2 < 7; let Z„ = i Y17=i Xi- lAhlswede and Winter! ^2mi ) gave the fundamental extension 
of the exponentiation trick for computing Chernof f bounds of scalar random variables to matrix 
valued random variables (for a simplified proof, see Wigderson and Xiao ( 20081 )): 

P [||Z„||2 > e] < inf 2de-"^*/^|| E [e^^^^]\Q. (1) 
By standard optimization of this bound, one readily obtains the non-commutative tail inequality 
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Theorem 8 (jAhlswede and Winteil tooi )). P[||Z„||2 > e] < 2(ie-""'/^^' . 



Proof. The statement is trivial if e > 7, so assume e < 7. The lemma follows from (JT]) and the 
following sequence after setting t = e/27 < 



E [e*^/^] II2 < 1 + E > [ll(X/7)l2] < 1 + < 



(2) 



where (a) follows from E[X] = 0, the triangle inequality and || E [■]\\2 < E[|| • ly; (b) follows because 
IIX/7II2 < 1 and t < i. ■ 

(We have stated a s implified ver s ion o f the bound, without taking care to optimize the con- 
stants.) As mentioned in Gross et al. ( 20091 ) . one can obtain a non-commuting version of Bernstein's 
inequality in a similar fashion using ([T]). Assume that || EX^X||2 < s^. By adapting the standard 
Bernstein bounding argument to matrices, we have 



Lemma 9. For symmetric X, || E [e*'^/''']||2 <exp(^(e*- l-t)j . 

Proof. As in ([2]), but using submultiplicativity, we first bound || E [X^]||2 < s^7^~^: 

j dX p{X)X^u 

dx p{xy- 



E[X^]||2 



max 

l|u||=l 



max 

l|u||=l 



|X^-2u||X2X^-2u 



1X^-2^11 



max 

||w|| = l 



< 7 

= 7'-'l|E[X1||2<s^7 
To conclude, we use the triangle inequality to bound as follows: 



dX p(X)X2w 

2,i-2 



£=2 



Using Lemma [9] in ([T]) with t = ln(l -|- e^/s"^), and using (1 -|- x) ln(l + ^) — I > 2^+2/3 ' 
obtain the following result. 

Theorem 10 (Non-commutative Bernstein). P[||Z.„||2 > e] < 2(ie""'^/(2^'+2'^'/^\ 

Gross et al\ (|2009l ^ gives a simpler version of this non-commutative Bernstein inequality. If 



X S j^Q^ symmetric, then by considering 

one can get a non-symmetric verision of the non-commutative Chernoff and Bernstein bounds, 
Theorem 11 feechtJ JioO^)). P[||Z„,||2 > e] < (di + (i2)e-"^'/(2s2+27e/3)_ 



For most of our purposes, we will only need the symmetric version; again, if e < 85^/7, then we 
have the much simpler bound P[||Z.„||2 > e] < 2de~^'' . 
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3 Orthonormal Sampling Lemmas 



Let U G IR'^X'^ be an orthonormal matrix, and let S G R'^^'^ be a diagonal matrix. We are interested 
in the product US G M"*^*^; US is the matrix with columns XJ^^^Sa. Without loss of generality, 
we can assume that S is positive by flipping the signs of the appropriate columns of U. The 
row-representation of U is U^ = [ui, . . . jU^]; we consider the row sampling probabilities 

~ trace(S^) 

Since U^U = Idxd, one can verify that trace(S^) = X^t^JS^uj is the correct normalization. 
Lemma 12 (Symmetric Subspace Sampling Lemma). 

P[||S2 - SU^Q^QUSII > e||Sf] < 2d • exp 

< 2d ■ exp 



2{p/l3-K-^ + e{p/f3-K-^)/3)J ' 



4p 

where p is the numerical (stable) rank ofS, p{S) = ||S|||,/||S|p, and «;(S) = o"niax(S)/o"jnin(S) is the 
condition number. 

Remarks. The stable rank p < d measures the effective dimension of the matrix. The condition 
number k > 1, hence the simpler version of the bound, which is valid for e < 3. It immediately 
follows that if r > (4p//3e^) In then with probability at least 1 — 5, 

||S2 - SU^Q^QUSII < e||Sf 

An important special case is when S = Idxdi in which case p = d, k = 1 and ||S|| = 1. 

Corollary 13. For sampling probabilities pt > ^u^u^, 

-/3re2 



P[||I - U^Q^QUII >e]<2d- exp 



4(d - /3) 



Proof, (of Lemma ll2p Note that U^Q^QU = ^ Yli=i ^W^ti/Pu^ where ti G [l,m] is chosen accord- 
ing to the probability . . It follows that 

S2 _ SU^Q^QUS = -ys^- —Suty, S = - y Xi, 

where Xj are independent copies of a matrix-random variable X ~ S^ — Suu^S/p. We prove the 
following three claims: 

(i) E [X] = 0; 

(u) ||X|| < ||Sf(p//5-K"2); 

(ii) ||EX^X|| < \\Sf{p/p-K-^). 

1 1 2 

The Lemma follows from the non-commutative Bernstein bound with e replaced by e||S|| . To prove 
(i), note that E[X] = S^ - SE [uu^/p]S = S^ - S (E™ ^ utuj) S = 0, because ^™ ^ utuj = U^U = 

Idxd- 



7 



To prove (m), let z be an arbitrary unit vector and consider 

z^Xz = z^S^z - -(z^Su)2. 



It follows that z^Xz < ||S||'^. To get a lower bound, we use p > /3u^S^u/trace(S'^): 



z-Xz > ,Ts2^_trace(S2)(z-Su^2 



2„ ' 



P uTS^u 

> ||S||2 trace(S2^ 



(a) follows because: by definition of fTmim the minimum of the first term is cr^:^^; and, by Cauchy- 
Schwarz, (z'^Su)^ < (zTz)(uTS2u). Since /3 < 1, p//3 - > 1 (for d > 1), and so jz^Xz] < 
||S||^ [p/f3 — from which (ii) follows. 

To prove (iii), first note that 

E[XTX] = S^-S3E[uuT/p]S-SE[uuT/p]S3 + SE[uuTS2uu7p2]S, 

(m ^ \ 
J2 ^u,u-S2u,uJ j S - 

(a) follows because E[uu^/j>] = I. Thus, for an arbitrary unit z, we have 

m ^ 

z^EfX^Xjz = ^-(zTSuiU^Sz)u^SV-z^SV, 

t=i ^* 

W trace(S^) ,\ 
< ^ z S I 2^utu, 1 S 

4 / trace(S^) z^S^z^ _ z^SV \ 

V /3||Sf ||S||^ J 
4 / trace(S^) _ 

V /3||S|p IISIIV' 



< 



Tc4 T 

z — z b z , 

2"\ r,TQ2 T r,TQ4 



(a) follows from pt > /3u^S^Uf/trace(S^); (b) follows from U^U = J2t^i'^t'^t — ^dxd- Thus, 
|z^ E [X^X]z| < ||Sf (p//3 - K"^), from which (m) follows. ■ 

For the general case, consider two orthonormal matrices W € M"^^'^i, V € M"'^'^^, and two 
positive diagonal matrices Si G j^'^iX'^i and S2 G M'^^^'^^. We consider the product SiW^VS2, which 
is approximated by the sampled product SiW^Q^QVS2. Consider the sampling probabilities 

KSfu,)V^(v?Siv,)V^ KSfu,)V^KSiv,)V^ 
^*-^E;:LiKS?u,)V.KSiv,)V2-^ A (S?)trace(Si) ' 



where the last inequality follows from Cauchy-Schwarz. Since ||A||^ = /9(A)||A|| > ||A||, any 
bound for the Frobenius norm can be converted into a bound for the spectral norm. Using the 
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Frobenius norm bounds in iDrineas et all (j2006al ) (using a simplified form for the bound), one 
immediately has: 

P [||SiW^VS2 - SiW^Q^QVSsll > e||Si||||S2||] < exp ( ^^^) , (4) 



16piP2 

where pi = p(Si) and p2 = p{S2)- Alternatively, if r > (16pip2//3^e^) In ^, then 

||SiW^VS2 - SiW^Q^QVS2|| < e||Si||||S2||. 

The dependence on the stable ranks and /3 is quadratic. Applying this bound to the situation 
in Lemma [12] would give an inferior bound. The intuition behind the improvement is that the 
sampling is isotropic, and so will not favor any particular direction. One can therefore guess that 
all the singular values are approximately equal and so the Frobenius norm bound on the spectral 
norm will be loose by a factor of y^; and, indeed this is what comes out in the closer analysis. As 
a application of Lemma [T2l we can get a better result for the asymmetric case. 

Lemma 14. Let W G R"''"^\ V G M'"><'^2 be orthonormal, and let Si G R^ix^i and S2 G M'^^xda 
be two positive diagonal matrices. Consider row sampling probabilities 



Pt > 13- 



Pi + P2 

If r > (8(pi + p2 ) / /3e^ ) In ^('^i ) ^ then with probability at least 1 — 5, 

||SiW^VS2 - SiW^Q^QVSall < e||Si||||S2|| 
For the special case that Si = Idixdi and 82 = 1^2x^21 sampling probabilities simplify to 



wTwt + vTvt 



di + d2 

Corollary 15. If r > {8{di + d2)//5e2) In ^^^i^ , then with probability at least 1 - 6, 

IIW^V- W^Q^QVII <e. 

Proof, (of Lemma I14p By homogeneity, we can without loss of generality assume that ||Si| 
IIS2II = 1, and let0 Z = [WSi VS2]. An elementary lemma which we will find useful is 

Lemma 16. For any matrix A = [Ai A2], 



max{||Ai||,||A2||} < ||A|| < ^ ||Aif + HAaf . 

The left inequality is saturated when Ai and A2 are orthogonal (A^A2 = 0), and the right 
inequality is saturated when Ai = A2. By repeatedly applying Lemma[T6]one can see that ||A|| is 
at least the spectral norm of any submatrix. Introduce the SVD of Z, 

Z = [WSi VS2] = usv^. 



""^The general case would have been Z — j^^i^WSi pjjy VS2j • 
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We now use the row sampling probabilities according to US from ([3]), 



Pt>p 



trace(S^ 



We may interpret the sampling probabilities as follows. Let zt be a row of Z, the concatenation of 
two rows in WSi and VS2: zj = [wJSi VJS2]. We also have that zj = u^SVg. Hence, 

uJS^Uf = u^SVzVzSut = zfzt = wJSiWt + v^S2Vt. 

These are exactly the probabilities as claimed in the statement of the lemma (modulo the rescaling) . 
Applying Lemma [12) if r > (4p//3e^) In ^ then with probability at least 1 — 6, 



IIS^ - SU^Q^QUSII < e||Sf < ||Sif + ||S2f = eV2, 
where the second inequality follows from Lemma [TU] Since ZV = US, 

IIZ^Z - Z^Q^QZII = ||S2 - SU^Q^QUS||. 
Further, by the construction of Z, 



Z^Z- Z^Q^QZ 



Sf - SiW^Q^QWSi SiW^VS2 - SiW^Q^QVSs 
S2V^WSi - SaV^Q^QWSi S^ - S2V^Q^QVS2 



By Lemmatti ||SiWTVS2 - SiW^Q^QVSsH < ||Z^Z - Z^Q^QZ||, and so: 

||SiW^VS2 - SiW^Q^QVS2|| < eV2. 

Observe that trace(S^) = ||Z||^ = trace(Sf) +trace(S2); further, since ||S|| > max{||Si||, IIS2II}, we 
have that 

trace(S^) trace(Sf ) + trace(S2) trace(Sf) trace(S2) 

/'(S) = ||g||2 = ^^1^ ^ ||g^||2 + ||g^||2 = Pi + /'2- 

Since rank(U) < di + d2, it suffices that r > (4(/9i + P2)/ (3e^) In obtain error after 

rescaling e' = e\/2, we have the result. ■ 

4 Sampling for Matrix Multiplication 

We obtain results for matrix multiplication directly from Lemmas [12] and [HI First we consider the 
symmetric case, then the asymmetric case. Let A S j^mxdi -g ^ ]^mxd2_ interested 
in conditions on the sampling matrix Q G '^^xm ^y^^^ that A^A ~ A A and A^B ~ A B, where 
A = QA and B = QB. Using the SVD of A, 

||A^A-A^Q^QA|| = llVASAUiUASAVi-VASAUlQ^QUASAVill, 
= ||Sa - SaUaQ'^QUaSaII- 

We may now directly apply Lemma [T2l with respect to the appropriate sampling probabilities. One 
can verify that the sampling probabilities in Lemma [12] are proportional to the squared norms of 
the rows of A. 
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Theorem 17. Let A e W^^^^ have rows at Obtain a sampling matrix Q G j^t^xm using row- 
sampling probabilities 

T ^ 

Pt>f3- 



Then, if r > In wii/i probability at least 1 — S, 

IIA^A- A^A|| < e||A|| 
Similarly, using the SVDs of A and B, 



|A^B-A^Q^QB|| = IIVaSaUIUbSbV^-VaSaUIQ^QUbSbV^ 
= IISaUXUbSb-SaUXQ^QUbSbII. 



We may now directly apply Lemma [HI with respect to the appropriate sampling probabilities. One 
can verify that the sampling probabilities in Lemma [HI are proportional to the sum of the rescaled 
squared norms of the rows of A and B. 

Theorem 18. Let A G M^x'^i and B e W^""^^ , have rescaled rows aj = at/||A|| and ht = bt/||B|| 
respectively. Obtain a sampling matrix Q G ]^''X"^ using row-sampling probabilities 

> aja-t + b^bt ^ a^a^ + b^b^ 
ET=i^l^t + hJht PA + PB 

Then, ifr> 8(pa+pb) 2(di+d2) ^ ^.^^ probability at least 1 - 6, 



lA^B- A^BII < ellAllllBl 



5 Sparse Row Based Matrix Representation 



Given a matrix A = USV^ E 



pmxd 



, the top k singular vectors, corresponding to the top k singular 



values give the best rank k reconstruction of A. Specifically, let A^ = U^S/jV^, where Ufc G 
Sfe e M'^^*^ and Vfe G 



pmx fc 



k. 



Then, ||A - Afc|| < ||A 



Ufc and Vfc correspond to the top-k left and right singular vectors. 
X|| where X G j^'i^xc* ranges over all rank-/c matrices. As usual, let 



A = QA be the sampled, rescaled rows of A, with A = USV , and consider the top-A: right 
singular vectors V^. Let Ilk be the projection onto this top-fc right singular space, and consider 
the rank k approximation to A obtained by projecting onto this space: A^ = AITfc. The following 
lemma is useful for showing that A^ is almost (up to additive error) as good an approximation to 
A as one can get. 



Lemma 19 ( Drineas et al. ( 2006bl ). Rudelson and Vershynin ( 2007 )) 



lA- A, 



< ||A - Afcf + 2||A^A - A^A|| < (||A - A^H + V2\\PJ K - A^A||^^V- 



Proof. The proof follows using standard arguments and an application of a perturbation theory 
result due to Weyl for bounding the change in any singular value upon hermitian perturbation of 
a hermitian matrix. ■ 
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Therefore, if we can approximate the matrix product A^A, we immediately get a good recon- 
struction for every k. The appropriate sampling probabilities from the previous section are 

Pt > Pt^- 

In this case, if r > {4p/(3e^) In then with probability at least 1 — d, 

||A- Afcf < ||A- Afcf + 2e||Af . 

The sampling probabilities are easy to compute and sampling can be accomplished in one pass if 
the matrix is stored row-by-row. 

To get a relative error result, we need a more carefully constructed set of non- uniform sampling 
probabilities. The problem here becomes apparent if A has rank k. In this case we have no hope of 
a relative error approximation unless we preserve the rank during sampling. To do so, we need to 
sample according to the actual singular vectors in U, not according to A; this is because sampling 
according to A can give especially large weight to a few of the large singular value directions, 
ignoring the small singular value directions and hence not preserving rank. By sampling according 
to U, we essentially put equal weight on all singular directions. To approximate U well, we need 
sampling probabilities 

Pt > -^'^t^t- 

Then, from Corollary [l3l if r > (4(d - /5)//5e^) In ^, with probability at least 1-5, 

||I_U^QTQu|| <e. 

Since ||U|| = 1, it also follows that 

IIUU^-UU^Q^QUU^II < e. 
This result is useful because of the following lemma. 

Lemma 20 dSpielman and Srivastaval ^200^ )). If ||UU^ - UU'^Q^QUU^H < e, then for every 



(1 - e)x^A^Ax < x^A^Ax < (1 + e)x^A^Ax. 
Proof We give a sketch of the proof from Spielman and Srivastava ( 20081 ). We let x 7^ range 



over col(U). Since col(U) = col(A), x G col(U) if and only if for some y G M'^, x = Ay. Since 
rank(A) = d, Ay ^ <;=^ y 7^ 0. Also note that UU^A = A, since UU^ is a projection operator 
onto the column space of U, which is the same as the column space of A. The following sequence 
establishes the lemma. 

Ix^UU^x-x^UU^Q^QUU^xl 
UU — UU Q QUU = sup , 

XT^O X^X 

ly^A^UU^Ay - y^A^UU^Q^QUU^Ay| 
" Ayft y^A-Ay 

|yTATAy_yTATQTQAy| 

= sup —T^ , 

Ay^o y^A^Ay 

ly^A^Ay - y^A^Ayl 

= sup —J- , 

y^o y^A Ay 

The lemma now follows because ||UU^ - UU^Q^QUU^|| < e. ■ 
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Via the Cour ant-Fischer characterization Golub and Van Loan ( 19831 ) of the singular values, it 
is immediate from Lemma [20l that the singular value spectrum is also preserved : 

(1 - e)cT,(ATA) < aiiA'A) < (1 + e)ai(A^A). (5) 

Lemma [20] along with ([5]) will allow us to prove the relative approximation result. 

Theorem 21. If pt > §u^ut and r > (4(d - /3)//3e2) In ^, then, for k = 1, . . . ,d, 



IIA-AEfcll < (^^j l|A-Afc||, 
where Hk projects onto the top k right singular vectors of A. 

1 /2 

Remarks For e < ^, ( ^rrf ) < 1 + 2e. Computing the probabilities pt involves knowing Uf 



which means one has to perform an SVD, in which case, one could use A^; it seems like overkill 
to compute A^ in order to approximate A^. We discuss approximate sampling schemes later, in 
Section [71 

Proof. Let ||x|| = 1. The following sequence establishes the result. 

||A(I-nfc)||^ = sup ||Ax||^ = sup x'^A'^Ax, 
xeker(nfc) xeker(nfe) 

< sup x^A^Ax, 

^ ~ ^ xeker{nfe) 

1 ~T ~ 

-(jfc+i(A A), 



1 - e 

< i±lcTfc+i(A^A) = i±^||A-Afcf. 



6 £2 Linear Regression with Relative Error Bounds 

A linear regression is represented by a real data matrix A G ^"m-xd. ^j-^^^h represents m points in 
W^, and a target vector y G M™. Traditionally, d (severly over constrained regression). The 

goal is to find a regression vector x* G which minimizes the £2 fit error (least squares regression) 

m 

8{-k) = ||Ax-y||2 = J^(a^x-2/t)2, 

t=i 

We assume such an optimal x* exists (it may not be unique unless A has full column rank), and is 
given by x* = A'''y, where denotes the More-Penrose pseudo-inverse; this problem can be solved 
in 0{md'^). Through row-sampling, it is possible to construct x, an approximation to the optimal 
regression weights x*, which is a relative error approximation to optimal, 

£{±) < (l + e)£:(x*). 
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As usual, let A = UaSaVX- Then A+ = VaS^^UX, and so X* = VS ^U^y. The predictions are 
y* = Ax* = UAU^y, which is the projection of y onto the column space of A. We define the 
residual e = y-y* = y- Ax* = (I - UAUX)y, so 

y = UAUly + e. (6) 

We will construct A and y by sampling rows: 

[A,y] = Q[A,y], 

and solve the linear regression problem on (A,y) to obtain x = A^y- Foi' ^ (0) g]) "^iH use 
the sampling probabilities 

to construct A and y. There are three parts to these sampling probabilities. The first part allows 
us to reconstruct A well from A; the second allows us to reconstruct A^e; and, the third allows us 
to reconstruct e. 

Note that A = QUaSaVa^; if QUa consisted of orthonormal columns, then this would be 
the SVD of A. Indeed, this is approximately so, as we will soon see. Let the SVD of A be 
A = U^S^V^^. Let U = QUa. Since pt > pnj/d, it follows from Corollary US that if r > 2^, 
for e G (0, 1), then, with high probability, 

||I-U^U|| <e. 

Since the eigenvalues of I — U^tj are given by 1 — afiV), it follows that 

1 - e < < 1 + e. 

So all the singular values of Ua are preserved after sampling. Essentially, it suffices to sample 
r = 0{d\xid/ e^) rows to preserve the entire spectrum of Ua- By choosing (say) e = ^, the 
rank of Ua is preserved with high probability, since all the singular values are bigger than ^. 
Thus, with high probability, rank(A) = rank(U^) = rank(QUA) = rank(UA) = rank(A). Since 
QUa has full rank, Sqjj^ is defined, and Squa — ^qUa ^ diagonal matrix whose diagonals are 
{af{\]) — l)/crj(U); thus, ||Squa — ^QUaII2 — ^- This allows us to quantify the degree to 

which QUa is orthonormal, because 



||(QUA)+-(QUAf|l2 = I|VqUaSq[j,UqUa"-VqUaSqUaU" 



QUaII2 



Squa - SquaIL < 



'2 



Finally, we can get a convenient form for A^ = (QA)"*", because QA = QUaSaVa has full rank, 
and so QUa = UquaSquaVqua has full rank (and hence is the product of full rank matrices). 
Thus, 



(QA)+ = (UquaSquaV^UaSaV1)+, 

= Va(SqUaVquaSa)"^Uqua, 

= VaSX^VqUaSqUa^QUa' 

= VASAnQUA)+, 



We summarize all this information in the next lemma. 
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Lemma 22. If r > {Ad/(3e ) In with probability at least 1 — 6, all of the following hold: 

rank(A) = rank(U^) = rank(QUA) = rank(UA) = rank(A); (8) 
l|SQUA-SQ[jJ|2<e/Vr^; (9) 
||(QUA)+-(QUAni2<e/\/r^; (10) 

(QA)+ = VaSa1(QUa)+. (11) 
In Lemma 1221 we h ave simplified the constant to 4; this is a strengthened form of Lemma 4.1 in 



DrmeasetadB l: in parUcular, the dependence on d is near-linear 



Remember that x = A y; we now bound WAx 



basically follows the line of reasoning in brineas et oZl (|2006dl ) . Under the conditions of Lemma 
with probability at least 1 — 5, 



2 



We only sketch the derivation which 



(a) 
(J 



|Ax-y|| = ||AA^y-y|| = || A(QA)+Qy - y|| 
||UA(QUA)+Qy-y|| 

||UA(QUA)+Q(UAUiy + e) - UAU^y - e|| 

||UA(QUA)+Qe-e|| 

= ||Ua((QUa)+ - (QUA)^)Qe + UA(QUA)^Qe - e|| 

(d) 

< ||(QUa)+ - (QUA)^IIIIQell + llUlQ^Qell + ||e|| 

< ^^IIQe|l + l|UlQ^Qe|| + ||e||. 
VI - e 

(a) follows from Lemma [22] (b) follows from ([6]); (c) follows Lemma [22l because QUa has full 
rank and so (QUa)~'"QUa = 1^; (d) follows from the triangle inequality and sub-multiplicativity 
using ||Ua|| = 1; finally, (e) follows from Lemma [22l We now see the rationale for the complicated 
sampling probabilities. Since pt > e^/e^e, for r large enough, by Theorem [TTl ||Qe||^ < ||e||^(l + e). 
Similarly, since U^e = 0, ||U\Q^Qe|| = ||U^e — U^Q^Qe|| ; so, we can apply Lemma [H] with 
Si = Irf, V = e/||e|| and S2 = ||e||. According to Lemma [Til if Pt > /3(u^ + el/e^e)/{d + 1), then 
if r is large enough, ||U^Q^Qe|| < e||e||. Since these are all probabilistic statements, we need to 
apply the union bound to ensure that all of them hold. Ultimately, we have the claimed result: 

Theorem 23. For sampling probabilities satisfying and for r > {8{d + l)//3e^) In let 
X = (QA)+Qy be the approximate regression. Then, with probability at least 1 — 36, 



l|Ax-y|| < |^l + e + e^i±^j ||Ax*-y||, 
where x* = A'''y is the optimal regression. 

Remarks For the proof of the theorem, we observe that any transformation matrix Q satisfying 
the following three properties with high probability will do: 

mi - U^Q^QUII < e; {ii)\\Qe\\ < (1 + e)||e||; (m)||U^Q^Qe|| < e||e||. 
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7 Estimating the Spectral Norm 



The row-norm based sampling is relatively straightforward for the symmetric product. For the 
asymmetric product, A^B, we need probabilities 



Pt > ^ • (12) 

PA + PB 

To get these probabilities, we need ||A|| and ||B||; since we can compute the exact product in 
0{mdid2), a practically useful algorithm would need to estimate ||A|| and ||B|| efficiently. Suppose 
we had estimates Aa,Ab which satisfy: 

(1 - 6)||Af < Ai < (1 + 6)||Af ; (1 - 6)||Bf < A| < (1 + 6)||Bf . 

We can construct probabilities satisfying the desired property with /3 = (1 — e)/(l + e). 



Pt 



|A||?./Ai + ||B||^/A2 



> 



B 



(l+6)||A||^ « ' (l+e)||A||- 



|A||^/(l-e)||Af + ||B||^/(l-6)||Af 



l + ey PA + PB 

One practical way to obtain ||A|p is using the power iteration. Given an arbitrary unit vector xq, 
for n > 1, let x„ = A^Ax„_i/||A'^Ax„_i||. Note that multiplying by A'^A can be done in 0{2mdi) 
operations. Since x„ is a unit vector, ||A'^Ax„|| < ||A||^. We now get a lower bound. Let xq be 
a random isotropic vector constructed using di independent standard Normal variates zi , . . . , ; 

so Xq = [zi, . . . , 2rfJ/ \J Zi + ■ ■ ■ + zji^^. Let A^ = ||A^Ax„|| be an estimate for ||A|p after n power 
iterations. 

Lemma 24. For some constant c < {"^ + 2)"^, with probability at least 1 — 8, 

IIAlP 



4+^-2 



Remarks n > clog -j- gives the desired constant factor approximation. Since each power iteration 
takes 0{mdi) time, and we run O(log^) power iterations, in 0{mdilog^) time, we obtain a 
sufficiently good estimate for ||A|| (and similarly for ||B||). 

Proof. Assume that xq = YliLi Q^iVj, where Vj are the eigenvectors of A^A with corresponding 
eigenvalues af > ■■■ > cr^^. Note ||A||^ = af. If fi^^ > af/2, then it trivially follows that 
||A^Ax„|| > af/2 for any n, so assume that o"^^ < (T^/2. We can thus partition the singular values 
into those at least crf/2 and those which are smaller; the latter set is non-empty. So assume for 
some k < di, al > o\l2 and ol^^ < o\l2. Since x„ = Y^i^^i^T^" l^t^^l^^t^f^ ^ we therefore 
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have: 



lA^Ax, 



Edi 2 ■ 



An 

i 



> 



Ek 
i=l «^ 



2 4{n+l) 



1=1 



2 4(n+l) 



(a) 
> 



> 



4 + 2-2"/Eti«?(^iM)'("+') 



0"f 



4 + 2-2n/Q,2 



(a) follows because for i > k + 1, af < crf/2; for i < /c, af/cr? < 4; and E 



A>k+1 



a. 



<E.>i 



a,- 



1. 



(b) foUows because E^=i ^^li'^i/^i)'^^^^^^ — "^i- The theorem will now follow if we show that with 
probability at least 1 — c6^^^, af > 5/d. It is clear that E[a^] = 1/d from isotropy. Without loss 
of generality, assume vi is aligned with the zi axis. So a\ = z1/ zf {zi, . . . ,Zd are independent 
standard normals). For 5 < I, we estimate P[af > 6/d\ as follows: 



zf J- 



zi > 



zl> 



d-6 



E 

i>2 



> p 


i>2 


(a) 

= P 


2 ^2 


> P 


\l>6 + 52/3] • I 



d-l 



In (a) we compute the probability that a Xi random variable exceeds a multiple of an independent 
Xd-i random variable, which follows from the definition of the distribution as a sum of squares of 
independent standard normals, (b) follows from independence and because one particular realiza- 
tion of the event in (a) is when Xi > and 6Xd-i/id—^) < 6+5'^/^. Since E[x'^_^/{d- 1)] = 1, 
and Var[x'^_i/{d — 1)] =2/{d — 1), by Chebyshev's inequality, 

1 9AV3 



d-r 



> 1 



d-l 



From the definition of the Xi distribution, we can bound ¥[xl < S + 6^/^], 



2i/2r(l/2) 



du ^-l/2g-«/2 < 



(5 + 52/3)1/2^ 
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and so 



' 2 ^' 



We now consider the samphng based approach to estimate the spectral norm. Pre-sample the 
rows of A using probabilities proportional to the row norms to construct A. We know that if 
r > (4pA//3e^)ln^, then 

||A^A- A^A|| < e||Af . 
It follows that we have a e-approximation to the spectral norm from 

HAWAII = ||A^A- A^A + A^A|| < (l + e)||Af; 
HAWAII = ||A^A- A^A + A^A|| < e||Af + ||A^A||. 

Thus, (1 — e)||A|p < ||A^A|| < (1 + e)||A|p. Along this route, one must first sample r rows, and 
then approximate the spectral norm of the resulting A. We may now combine with the power 
iteration on A A to get a constant factor approximation efficiently (or we may compute exactly in 
0{rdi)). Specifically, set e = in which case, with high probability, gllAjl < ||A A|| < 2IIAII • 
Now, choose the number of power iterations n > n* , where ^ = 2"*. In this case, after n power 
iterations, we have an estimate which is at least --^||A|| from Lemma [24| which proves Theoreml251 

Theorem 25. With r > In the spectral norm estimate af obtained after cln ^ power 

7 T 



iterations on A A starting from an isotropic random vector satisfies 

^ i|A||2<a?<-||A|p. 



2^5 " ^ " 2' 

Further, the estimate d\ can be computed in 0{mdi + pAdi/e'^\v?{^)). 

As mentioned at the begining of this section, constant factor approximations to the spectral 
norms of the relevant matrices is enough to obtain probabilities satisfying ()12p for some constant 
/3. 
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